The world indicators dataset compares different countries based on selected attributes. 1. To use K-means and hierarchical clustering methods to group similar countries together 2. To use Internal validation metrics such as Silhouette co-efficient to report the cluster quality 3. Tune the hyperparameters to report the best clustering solution. 4. To give a detailed list of all the groups and the countries included within each group 5. To generate few plots that would explain the characteristics of each cluster
df <- read_csv("World Indicators.csv")
str(df)
## spec_tbl_df [208 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Birth Rate : num [1:208] 0.025 0.046 0.037 0.024 0.042 0.045 0.038 0.035 0.047 0.036 ...
## $ Business Tax Rate : chr [1:208] "72.00%" "52.10%" "65.90%" "19.50%" ...
## $ Days to Start Business: num [1:208] 25 66 29 60 13 13 15 22 55 22 ...
## $ Energy Usage : num [1:208] 41852 13576 3761 2215 NA ...
## $ GDP : chr [1:208] "$199,070,864,638" "$104,115,863,405" "$7,294,900,431" "$15,292,424,757" ...
## $ Health Exp % GDP : num [1:208] 0.044 0.034 0.045 0.052 0.064 0.09 0.054 0.039 0.028 0.036 ...
## $ Health Exp/Capita : chr [1:208] "$233" "$178" "$34" "$404" ...
## $ Hours to do Tax : num [1:208] 451 282 270 152 270 274 654 504 732 100 ...
## $ Infant Mortality Rate : num [1:208] 0.023 0.107 0.06 0.039 0.068 0.059 0.064 0.1 0.092 0.061 ...
## $ Internet Usage : num [1:208] 0.1 0.1 0 0.1 0 0 0.1 0 0 0.1 ...
## $ Lending Interest : num [1:208] 0.08 0.188 NA 0.11 NA 0.132 NA NA NA 0.105 ...
## $ Life Expectancy Female: num [1:208] 72 53 60 46 56 55 55 51 51 62 ...
## $ Life Expectancy Male : num [1:208] 69 50 58 47 55 51 53 47 49 59 ...
## $ Mobile Phone Usage : num [1:208] 0.9 0.5 0.8 1.5 0.5 0.2 0.5 0.2 0.3 0.3 ...
## $ Population 0-14 : num [1:208] 0.272 0.477 0.432 0.34 0.458 0.44 0.432 0.404 0.487 0.422 ...
## $ Population 15-64 : num [1:208] 0.681 0.499 0.539 0.625 0.517 0.535 0.535 0.558 0.488 0.549 ...
## $ Population 65+ : num [1:208] 0.047 0.024 0.029 0.035 0.025 0.025 0.032 0.039 0.025 0.029 ...
## $ Population Urban : num [1:208] 0.682 0.409 0.423 0.565 0.265 0.109 0.521 0.39 0.22 0.28 ...
## $ Region : chr [1:208] "Africa" "Africa" "Africa" "Africa" ...
## $ Country : chr [1:208] "Algeria" "Angola" "Benin" "Botswana" ...
## - attr(*, "spec")=
## .. cols(
## .. `Birth Rate` = col_double(),
## .. `Business Tax Rate` = col_character(),
## .. `Days to Start Business` = col_double(),
## .. `Energy Usage` = col_double(),
## .. GDP = col_character(),
## .. `Health Exp % GDP` = col_double(),
## .. `Health Exp/Capita` = col_character(),
## .. `Hours to do Tax` = col_double(),
## .. `Infant Mortality Rate` = col_double(),
## .. `Internet Usage` = col_double(),
## .. `Lending Interest` = col_double(),
## .. `Life Expectancy Female` = col_double(),
## .. `Life Expectancy Male` = col_double(),
## .. `Mobile Phone Usage` = col_double(),
## .. `Population 0-14` = col_double(),
## .. `Population 15-64` = col_double(),
## .. `Population 65+` = col_double(),
## .. `Population Urban` = col_double(),
## .. Region = col_character(),
## .. Country = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
sort(colSums(is.na(df)))
## Region Country Population Urban
## 0 0 2
## Birth Rate Internet Usage Life Expectancy Female
## 9 9 11
## Life Expectancy Male Mobile Phone Usage Population 0-14
## 11 12 17
## Population 15-64 Population 65+ GDP
## 17 17 20
## Infant Mortality Rate Health Exp % GDP Health Exp/Capita
## 20 23 23
## Business Tax Rate Days to Start Business Hours to do Tax
## 27 27 28
## Energy Usage Lending Interest
## 72 77
Energy Usage and Lending Interest columns have around 35% of NA values. Mostly we will be dropping these columns during our analysis
df$GDP <- parse_number(df$GDP)
df$`Health Exp/Capita` <- parse_number(df$`Health Exp/Capita`)
df$`Business Tax Rate` <- parse_number(df$`Business Tax Rate`)
df_num <- as.data.frame(select_if(df, is.numeric))
corval <- cor(df_num, use="na.or.complete")
corrplot(corval, type="upper")
In the correlation plot above, it can be observed that few columns correlate well with the other columns. This means that few columns (or features) are redundant and can be dropped from the dataframe. Columns such as Birth Rate, Infant Mortality Rate, Internet Usage, Life Expectancy Female, Life Expectancy Male, Population 0-14, Population 15-64 correlate highly with each other. Another such group is Energy Usage and GDP as well as GDP, Health Exp % GDP, Health Exp/Capita. Instead of dropping all of these columns, we have to retain one column to represent these groups. For the last two groups, it is easy to determine the set of columns that we have to retain; we are going to retain GDP from both the groups. However, for the first group, it is difficult to externally determine the representing column. We will retain the column that has the highest correlation value with all other columns. To find out which column has the highest correlation, let’s use the “corval” matrix.
colvec <- c("Birth Rate", "Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
"Life Expectancy Male", "Population 0-14", "Population 15-64")
cval1 <- corval[colvec, colvec]
sort(colMeans(abs(cval1)), decreasing = TRUE)[1]
## Birth Rate
## 0.8443748
It can be concluded that Birth Rate is the column that we want to retain among these 7 columns.
colvec2 <- c("Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
"Life Expectancy Male", "Population 0-14", "Population 15-64", "Energy Usage",
"Health Exp % GDP", "Health Exp/Capita")
df2 <- df[, !(names(df) %in% colvec2)]
df_num <- df_num[, !(names(df_num) %in% colvec2)]
sort(colSums(is.na(df2)), decreasing=TRUE)
## Lending Interest Hours to do Tax Business Tax Rate
## 77 28 27
## Days to Start Business GDP Population 65+
## 27 20 17
## Mobile Phone Usage Birth Rate Population Urban
## 12 9 2
## Region Country
## 0 0
coldrop <- c("Lending Interest", "Hours to do Tax", "Business Tax Rate", "Days to Start Business")
df2 <- df2[, !names(df2) %in% coldrop]
df_num <- df_num[, !names(df_num) %in% coldrop]
colmean <- as.list(colMeans(df_num, na.rm=TRUE))
df_num <- replace_na(df_num, colmean)
df2 <- replace_na(df2, colmean)
We have the final set of 5 columns (or features) that we are going to perform clustering. ##### Let us feature scale each of the columns to obtain best clustering result. But before that, let us boxplot each of these features quickly to see if there are any outliers
for (i in (1:5)) {
boxplot(df2[i], xlab=names(df2[i]))
}
Standard Scaler
df2[1:5] <- scale(df2[1:5], center=TRUE, scale=TRUE)
Now we are ready for clustering. However, we need to find the optimal number of clusters first. ##### Let us implement Elbow method and Silhouette method to get the number of clusters.
#kmeans(df2[1:5], )
fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "wss")
fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "silhouette")
Both Elbow method (WSS) and Silhouette method suggests that 2 is the optimal number of clusters.
Dunn Index
Silhouette Average Width Co-efficient
centersloop=2:10
dunns <- list()
sse <- list()
ssb <- list()
s1 <- list()
d1 <- dist(df2[1:5], method = "euclidean")
for (centers in centersloop) {
k1 <- kmeans(df2[1:5], centers = centers, iter.max=100000, nstart=200)
sse <- append(sse, k1$tot.withinss) #Sum of Squared Errors within cluster
ssb <- append(ssb, k1$totss) #Group Sum of Squares
dunns <- append(dunns, dunn(clusters=k1$cluster, Data=df2[1:5])) #Dunn index
s1 <- append(s1, summary(silhouette(k1$cluster, d1))$avg.width) #Silhouette Average Width
}
clusters <- 2:10
d1 <- dist(df2[1:5], method = "euclidean")
hc1 <- hclust(d1, method = "complete")
dunns2 <- list()
s2 <- list()
plot(hc1, cex = 0.6, hang = -1)
for (c in clusters) {
hclust1 <- cutree(hc1, k = c)
dunns2 <- append(dunns2, dunn(clusters=hclust1, Data=df2[1:5])) #Dunn index
s2 <- append(s2, summary(silhouette(hclust1, d1))$avg.width) #Silhouette Average Width
}
plot(x=2:10, y=s1, type="b", xlab="Number of Clusters", ylab="Silhouette co-efficient of K means")
plot(x=2:10, y=s2, type="b", xlab="Number of Clusters",
ylab="Silhouette co-efficient of Hierarchical clustering")
#Hierarchical clustering with 2 centers gives erroneous results and we are ignoring it deliberately
The above plots indicate that we can get the best clustering solution by implementing kmeans with center equals 3. ##### Implementing K means with 3 centers.
k1 <- kmeans(df2[1:5], centers = 3, iter.max=100000, nstart=20)
print("Number of countries in each cluster")
## [1] "Number of countries in each cluster"
print(table(k1$cluster))
##
## 1 2 3
## 3 79 126
c1 <- subset(df, k1$cluster == 1)
c2 <- subset(df, k1$cluster == 2)
c3 <- subset(df, k1$cluster == 3)
print("Countries in the First Cluster")
## [1] "Countries in the First Cluster"
print(c1[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 3 × 5
## Country Region GDP `Life Expectancy Female` `Birth Rate`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 China Asia 7.32e12 76 0.012
## 2 Japan Asia 5.91e12 86 0.008
## 3 United States The Americas 1.55e13 81 0.013
print("Countries in the Second Cluster")
## [1] "Countries in the Second Cluster"
print(c2[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 79 × 5
## Country Region GDP `Life Expectancy F… `Birth Rate`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Angola Africa 104115863405 53 0.046
## 2 Benin Africa 7294900431 60 0.037
## 3 Burkina Faso Africa 10395757480 56 0.042
## 4 Burundi Africa 2355652064 55 0.045
## 5 Cameroon Africa 25486923059 55 0.038
## 6 Central African Republic Africa 2195599491 51 0.035
## 7 Chad Africa 12156380062 51 0.047
## 8 Comoros Africa 610372697 62 0.036
## 9 Congo, Dem. Rep. Africa 23831631365 51 0.044
## 10 Congo, Rep. Africa 14425606793 59 0.038
## # … with 69 more rows
print("Countries in the Third Cluster")
## [1] "Countries in the Third Cluster"
print(c3[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])
## # A tibble: 126 × 5
## Country Region GDP `Life Expectancy Female` `Birth Rate`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Algeria Africa 199070864638 72 0.025
## 2 Botswana Africa 15292424757 46 0.024
## 3 Gabon Africa 18796191833 64 0.032
## 4 Libya Africa 34699395524 77 0.022
## 5 Mauritius Africa 11252405860 77 0.011
## 6 Morocco Africa 99211339029 72 0.022
## 7 Seychelles Africa 1059593023 78 0.019
## 8 South Africa Africa 403894316555 57 0.021
## 9 Tunisia Africa 45951129422 77 0.019
## 10 Armenia Asia 10142342770 78 0.014
## # … with 116 more rows
print(k1$centers)
## Birth Rate GDP Mobile Phone Usage Population 65+ Population Urban
## 1 -1.0261167 6.93525546 -0.2195440 1.4622538 0.6809755
## 2 1.0153847 -0.22878806 -0.8812683 -0.8066485 -0.8577265
## 3 -0.6121988 -0.02167865 0.5577685 0.4709402 0.5215672
heatmap(k1$centers, Rowv=NULL, Colv=NULL, xlab="Features", ylab="Clusters")
The above heatmap can give a gist of the characteristics of each of the cluster created by the algorithm
df$Clusters <- k1$cluster
fig1 <- plot_ly(data = df,
x=df$`Birth Rate`, y=df$GDP,
type="scatter", mode="markers",
color = as.factor(df$Clusters),
colors = "Set1",
alpha = 0.7,
text = ~Country,
hovertemplate= '<b>%{text}</b>'
)
fig1 <- fig1 %>%
layout(title = "Birth Rate vs GDP",
xaxis = list(title = 'Birth Rate'),
yaxis = list(title = 'GDP', type="log")
)
fig1
## Warning: Ignoring 24 observations
# ggplot(df, aes(x=df$`Birth Rate`, y=GDP, color=as.factor(Clusters)))+
# geom_point()+
# scale_y_continuous(trans="log", labels=dollar)+
# labs(x="Birth Rate", y="GDP", color="Clusters", title="Birth Rate vs GDP")
#theme(legend.title = "Clusters")
Here, it can be observed that Birth Rate and GDP are inversely related. And the clusters divided can be observed to be as - (The order might differ with each successive runs) 1. Lower GDP and Higher Birth Rate 2. Higher GDP and Lower Birth Rate 3. Medium GDP and Medium Birth Rate
fig1 <- plot_ly(data = df,
x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`,
type="scatter", mode="markers",
color = as.factor(df$Clusters),
colors = "Set1",
alpha = 0.7,
text = ~Country,
hovertemplate= '<b>%{text}</b>'
)
fig1 <- fig1 %>%
layout(title = "Health Exp/Capita vs Infant Mortality Rate",
xaxis = list(type="log", title = 'Health Exp/Capita'),
yaxis = list(title = 'Infant Mortality Rate')
)
fig1
## Warning: Ignoring 23 observations
# ggplot(df, aes(x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`, color=as.factor(Clusters)))+
# geom_point()+
# scale_x_continuous(trans="log", labels=dollar)+
# labs(x="Health Exp/Capita", y="Infant Mortality Rate", color="Clusters")
# #theme(legend.title = "Clusters")
As one would expect, Infant mortality rate and Health Expenditure Per capita is inversely proportional and again we observe 3 clusters - 1. High IMR and Low Health Expenditure per capita 2. Moderate IMR and Moderate Health Expenditure per capita 3. Low IMR and High Health Expenditure per capita
#df$Clusters <- k1$cluster
fig1 <- plot_ly(data = df,
x=df$`Population Urban`, y=df$GDP,
type="scatter", mode="markers",
color = as.factor(df$Clusters),
colors = "Set1",
alpha = 0.7,
text = ~Country,
hovertemplate= '<b>%{text}</b>'
)
fig1 <- fig1 %>%
layout(title = "Population Urban vs GDP",
xaxis = list(title = 'Population Urban'),
yaxis = list(type="log", title = 'GDP')
)
fig1
## Warning: Ignoring 21 observations
# ggplot(df, aes(x=df$`Population Urban`, y=GDP, color=as.factor(Clusters)))+
# geom_point()+
# scale_y_continuous(trans="log", labels=dollar)+
# labs(x="Population Urban", y="GDP", color="Clusters")
# #theme(legend.title = "Clusters")
A positive trend can be observed between population living in the urban and GDP. The clusters divided can be observed as - 1. Lower population in the urban and Lower GDP 2. Higher population in the urban and Higher GDP 3. Scattered cluster
The clusters divided make sense even in the real world where the top performing countries like USA, China, Japan are in one cluster and then the moderate performing (majority of the European countries) and the poorly performing countries (majority of African countries). As the dataset contains the features that are mostly related to economy and healthy lifestyle, we get these clusters from the algorithm. If we wish to see these based on other factors (natural resources availability, human resources etc.), then we should add more features.